Using Twitter for Social Science Research

This online tutorial is designed to give you some first hands-on experience with accessing and analysing Twitter data for social science research.

The content covered in this tutorial is by no means exhaustive—it is designed to give you a first taste of what you can do with data from the Twitter API.

What’s covered

In this tutorial you will learn how to:

  • Get developers access credentials to Twitter data

  • Use the rtweet package to query the Twitter API

  • Locate tweeters on map

  • Collect different types of data on protest

  • Predict user type from account (bot yes/no)

Setup

Installing R

We will be using R in the exercises below and will assume some familiarity with the programming language.

If you haven’t used R before or would like to refresh your memory, a great point of reference is the book “R For Data Science” by Hadley Wickham.

You can install R by clicking here. We also recommend using RStudio for an easy-to-use development environment.

Connect to Twitter’s API

Below we will briefly describe how to obtain access to Twitter’s API. For a more detailed description of the authenticaction process, read the following vignette by Michael W. Kearney, the creator of the package.

To connect to Twitter’s API, you need a consumer key and a consumer secret, both of which you get by creating a Twitter App.

To create an app, you will first need to apply for a developer account. To do so, create a Twitter acount or log into your existing one, and go to the Twitter developer portal.

Click on Apply in the navigation bar on the top right of the page, and fill in all the relevant information before submitting your application. Your application will then be reviewed by Twitter before access is granted. This might take hours or days.

Once you have acquired a developers account, navigate to developer.twitter.com/apps and “Create a New App”. This includes the name, description and reasons of use for the app. This is Chris’ account, and you can see that he has registered several apps for different purposes.

After creating an app, you will be given a consumer key (or “API key”) and consumer secret (or “API secret key”), which you will use to interact with the Twitter API. You MUST make a record of these.

Select your Twitter app and click on the tab labelled “keys and tokens”. Click on “Create” to obtain your four keys, and copy and paste them into an R script file in order to create and permanently store your access token.

The create_token() function will save your access token as an environment variable for you. This way, the rtweet package will automatically find the token next time you try to access the Twitter API.

Once you have all of these keys and tokens recorded somewhere safe, you are ready to collect data!

Querying the Twitter API with rtweet

The rtweet package makes it very easy to collect and analyse Twitter data, including individual tweets, or follower and friendship networks.

Let’s begin by collecting the last tweets mentioning the hashtag #BLM or #BlackLivesMatter. We are collecting 1500 tweets here, but you can choose a higher or lower number of tweets. Note that, to return more than 18,000 tweets in a single call, users must set retryonratelimit = TRUE. Here, we have set include_rts = FALSE meaning that all of our tweets are original tweets rather than retweets.

Since many of you may not have obtained a developers account, we have uploaded a sample dataset to allow you to familiarise yourself with data from the Twitter API. To download the data, clone our Github repository, navigate to the downloaded folders, and run all code in the corresponding R markdown file (01_analyse_twitter_data.Rmd).

Alternatively, you can also download all datasets directly from the Github website. Note that you will have to add “https://raw.githack.com/ArunFrey/oss_twitter/main/” to all subsequent file paths to download the data, so we encourage working from within the cloned Github repo instead.

The Twitter API allows us to retrieve a lot of information about tweets and users, but let’s stick with a few for now.

created_at screen_name text
2021-03-12 09:35:14 Purge321

Sorry, we don’t stand with your rhetoric. The moment your ALM hashtag came out was the turning point.

#BlackLivesMatter #Asians4BlackLives https://t.co/CltI8nW2AH
2021-03-12 09:36:34 vernon78784378 @ASK_des MY @Labour of 1950’s destroyed. Replaced by #woke, #antisemitic, #antizionist, #antiwhite, #misandrist, #antipolice #BLM, #rejoiner economically illiterate factions with @Keir_Starmer’s front bench squabbling like fizzing lumps risen up in a cess pit @labouryouth @youngfabians https://t.co/1qoWiUG1kD
2021-03-12 09:38:37 KatlynSkett Keir starmer AKA NETANYAHU PUPPET with his blairites AKA LFI Are not only purging the left But have been openly conspiring against them since #corbyn Was elected leader #LabourLeaks shows that evidence #FreePalestine #corbynWasRight #StarmerOut #StarmerIsARacist #BLM https://t.co/NabTJnWXbf https://t.co/sGTAlBOXFE
2021-03-12 09:41:48 SHAWNTAKAS

🤣🤣Ugandan Police score a distinction in corruption.

What would your country’s police score in your opinion?🤔 #WitsProtests #corruption #PoliceBrutality #BlackLivesMatter #Police https://t.co/QPFTVaeXcQ
2021-03-12 09:45:47 culturereviewed When Wits black students were fighting for the doors of learning to be open as the Freedom Charter promises, police responded with violence. Those are the “black crowd control” techniques they know. Nothing else #WitsProtest #BlackLivesMatter #CReview

These tweets all occur within quick succession of each other (Note: here, the tweets were collected in advance of the workshop, explaining why the dates are not more recent). In fact, we can visualise the frequency of the tweets. We could use the ggplot package for this, but rtweet already has a built in function to easily visualise time serie of Twitter data.

We can see that all 1500 tweets capture only a small fraction of the online discussion surrounding Black Lives Matter on Twitter. What’s more, the plot reveals that users are more active during certain periods of the day.

Studying protests using Twitter

Next, we turn our attention to using Twitter data to study protest events, and focus on the 2017 Women’s March protests. Protests are notoriously hard to survey, and Twitter can potentially provide us with valuable insights into who is participating in a demonstration.

Below is a map of all geolocated tweets that were sent on January 21, 2017, the day of the Women’s March protest, showing that users across the world tweeted about the event.

We can manipulate these data into a SpatialPointsDataFrame, making sure the CRS is correctly defined, allowing us to plot the points easily using base R plotting functions. The CRS stands for “Coordinate Reference System,” which controls the “projection” of the map we wish to visualize–i.e., how it looks. For more info. on map projections, see this guide.

Manipulating the data in this way will be helpful when clipping to the boundaries of shapefiles as we go on to describe below.

Let’s begin by loading the data:

These data have been stripped of user identifying information, including user name, bio etc. Instead we just have two columns: latitude and longitude. The points are from all tweets that contained in the #WomensMarch. When we plot the simple latitude and longitude of the points, we can make out the vague outline of countries.

In order to identify points within the routes of our targeted protest marches, we first read in three shapefiles. Each of these have been created by drawing a buffer of increasing sizes around the route of the 2017 Washington D.C. Women’s March.

We can compare these to the original march route below:

We can create these shapefiles with relative ease in open-source GIS softwares like QGIS.

It is not completely necessary to use these more accurate geographic projections of protest routes, however. In fact, the use of a rectangular bounding box is able to capture these same protestors, with limited cost in terms of inaccuracy. To find the coordinates of a bounding box, we recommend using the open-source OpenStreetMap platform.

As shown below, by searching a location in OpenStreetMap, and selecting the “Export” option at the top of the window, we are able to view the coordinates of the left-upper and right-lower diagonals of the map displayed in the viewer window. The user can zoom in and out on this map in order to select an appropriate geographical area.

To generate a rectangular bounding box object from these four coordinates, we simply need to combine them into a matrix for the purposes of plotting. We can then convert this into a spatial object, and assign the relevant CRS–the same as we assigned to our spatial points above.

We show the coordinates in the image above. From these coordinates, we can easily now generate a SpatialPolygons bounding box by combing the x1, y1, x2, and y2 coordinates into a matrix

We compare our bounding box shape to our route buffer shapes below.

We can see that our rectangular bounding box shape covers a larger area than march route shapes. If we think this bounding box is too large, we can always reduce it in size by lifting new coordinates from OpenStreetMap, converting to a matrix, and generating a smaller spatial bounding box. For now, we will continue with the bounding box we have generated.

One of the challenges with Twitter data is that it is unclear whether someone who tweets about a protest actually participates in it. Information on the geo-location of users allows us to assess whether or not a user tweeted from within the protest march.

The above, using open source GIS softwares, means we can easily locate individuals to within the route of a protest march, providing a confident measure of participation.

Protest hashtags

We have also provided you with a sample dataset containing a subset of 500 users who tweeted from D.C about the Women’s March on the day of the protest. We have changed the names and status ids of all tweets in the data, and have only uploaded information on a few key variables.

To load the dataset, run the following code:

status_id screen_name created_at hashtags
04f8ecca36a6c773 amusing_mule Sat Jan 21 23:59:44 +0000 2017 womensMarch
6b47bbc6904c6383 evilminded_springpeeper Sat Jan 21 23:59:22 +0000 2017 Trump WomensMarch
4fd844a06013dbce nonirrational_iceblueredtopzebra Sat Jan 21 23:57:34 +0000 2017 womensmarch
9b4d721dcdd8fbb4 frigid_junebug Sat Jan 21 23:56:13 +0000 2017 womensmarch womensmarchonwashington
910e8cfc0a5597d2 benignant_nabarlek Sat Jan 21 23:55:28 +0000 2017 womensmarchonwashington womensmarch

We can look at which hashtags were used most frequently during the march in DC. To do that, we use the hashtags variable, which lists the hashtags used in each tweet. To separate multiple hashtags into individual rows, we use the unnest_tokens command from the tidytext package. The plot below visualises all hashtags that were used at least 10 times during the 2017 Women’s March in DC.

Estimating ideology

Once we have located our protestor-users, the estimation of their ideological position (based on their follow network) is straightforward using the tweetscores package by Pablo Barberá. We will not estimate the ideologies of our users above as they have been anonymized. But you can certainly look at your own: simply change the user name to your own Twitter username. Note: you will also need to use the authentication token (here: my_oauth_CB) you created above to download the following network of Twitter users. For more information follow the steps outlined by Barberá here

If you want to estimate the ideology for multiple users, we suggest opting for Maximum Likelihood estimation, which is considerably faster. You can do this by using the estimateIdeology2 function (see here for the estimation functions for each of these two approaches.)

Some other ways to use Twitter data

This is only the beginning of what we can do with Twitter data. The code below uses the tweetbotornot2 package by Michael Kearney, the author of the rtweet package, and predicts how many accounts in our Black Lives Matter tweets dataset are likely bots.

We plot a histogram of predicted likely bot accounts below, along with a short selection of some of the tweets.

## [10:37:35] WARNING: amalgamation/../src/learner.cc:790: Loading model from XGBoost < 1.0.0, consider saving it again for improved compatibility

screen_name text
heyman_bot #BLM
PortsideOrg Amy Sherald Directs Her Breonna Taylor Painting Toward Justice #BreonnaTaylor #Louisville #policekillings #DefundthePolice #police #AfricanAmericanart #BLM #BlackLivesMatter https://t.co/HXUtAerPeS
culturereviewed When Wits black students were fighting for the doors of learning to be open as the Freedom Charter promises, police responded with violence. Those are the “black crowd control” techniques they know. Nothing else #WitsProtest #BlackLivesMatter #CReview
say_the_names Willie McCoy #BlackLivesMatter
say_the_names Michael Dean #BlackLivesMatter
BLMProtestBot If you’re reading this, remember that #BlackLivesMatter
say_the_names John Crawford III #BlackLivesMatter
tbasharks Wed Jul 29 2020 Portland, Oregon - Independent journalist arrested WATCH: https://t.co/3PUJqzKHfX #PortlandOregon #PPD #blacklivesmatter #blm #defundthepolice #abolishthepolice
tbasharks Sat May 30 2020 Austin, Texas - Police shoot non-violent protester in the head WATCH: https://t.co/tMCKqCxjkp #AustinTexas #APD #blacklivesmatter #blm #defundthepolice #abolishthepolice
TheLivesThatMtr Say their name. Derrick Lee Hunt, 2015-08-07 #BlackLivesMatter.